library(obisindicators)
library(dplyr)
OBIS is small enough to analyse by itself on a decent laptop computer or a small cluster. We have previously visualized ES50 for OBIS data by decade.
But adding GBIF has an order of magnitude more records than OBIS, so combining the two takes it to another level. Options for tackling this volume of data include increasing the memory in your cluster or splitting up the analysis into more manageable chunks. Here we demonstrate the latter.
I had to make the analysis work on one of three resources:
The data are ~ 185 GB, representing 2.4 billion records and XX columns. The parquet file format helps, but generally chokes when trying to calculate the diversity metrics because:
As the dask
documentation points out, sometimes you need to chunk up your data.
I tried using dask and spark in addition to R
arrow but stopped putting effort into those methods because
I was less comfortable with python and java then R and it became
apparent that no matter what I would have to process the data in chunks
to compute the global data set on the resources I had access to. The R
package sparklyr required installation of a Spark
environment, something I couldn’t do without IT intervention, so I
didn’t pursue it. Similarly, while it was easy to spin up a dask cluster
on AWS, I couldn’t easily do that on our HPC, so I didn’t put effort
into pursuing it. But going forward, both options could probably
accomplish the same thing as this R work flow.
We created maps with both continuous color schemes and with binned color schemes. We found that in the former it was difficult to differentiate the middle values. Our conclusion was that precise values were not as useful as being able to quickly understand where biodiversity was extremely low, high, and what the approximate “normal” was for various regions.
| Continuous Color | Discrete Color |
|---|---|